Skip to content

[Minor] add non topk benchmarks for utf8/utf8view string aggregates#21073

Open
buraksenn wants to merge 4 commits intoapache:mainfrom
buraksenn:add-non-top-k-benchmarks-to-compare
Open

[Minor] add non topk benchmarks for utf8/utf8view string aggregates#21073
buraksenn wants to merge 4 commits intoapache:mainfrom
buraksenn:add-non-top-k-benchmarks-to-compare

Conversation

@buraksenn
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Details are in #19713 but main idea is to compare non-topk and topk test results so that we can compare performances

What changes are included in this PR?

Added non topk benchmark tests.

Are these changes tested?

Only test changes

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label Mar 20, 2026
Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 @buraksenn

Thanks for working on this.

@buraksenn
Copy link
Contributor Author

Thanks @kosiew for the detailed review

@buraksenn buraksenn changed the title [Minor] add non topk benchmarks for ut8/ut8view string aggregates [Minor] add non topk benchmarks for utf8/utf8view string aggregates Mar 20, 2026
Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@buraksenn

Thanks for the updates here, this is heading in a good direction. I did notice one issue that affects the benchmark results, plus a couple of smaller readability suggestions.

assert_eq!(batch.num_rows(), LIMIT);

Ok(())
Ok(format!("{}", pretty_format_batches(&batches)?))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like aggregate_string now pretty prints the collected batches and returns a String, and that is what b.iter(...) is measuring for each string benchmark.

This means the non TopK string benchmarks now include batch formatting and heap allocation overhead, which the numeric benchmarks do not include. So the comparison is no longer isolating query execution.

Would it make sense to keep a Result<()> fast path for the timed benchmark, and move the pretty_format_batches work into a one time validation helper that runs before benchmark registration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it such that it is done in the caller path and function returns vector

.as_str(),
|b| b.iter(|| run_string(&rt, ctx.clone(), limit, true)),
);
for asc in [false, true] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small readability thought here. Could the Utf8 vs Utf8View parity check live in a helper?

Right now criterion_benchmark is doing both benchmark registration and cross layout validation. Pulling the verification loop into something like assert_string_results_match(...) would make the benchmark matrix easier to scan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've extracted two helpers instead of this

);
// String aggregate benchmarks
// (asc, use_topk, use_view)
let string_cases: &[(bool, bool, bool)] = &[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The &[(bool, bool, bool)] tuple list is a bit hard to read and reason about when scanning or extending cases.

Maybe a small case struct or a helper similar to numeric_cases would help make the intent clearer. Something that names asc, use_topk, and use_view explicitly would also reduce the chance of mixing up tuple positions later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made this and numeric one both structs

@buraksenn
Copy link
Contributor Author

Thanks for the detailed review twice @kosiew, really appreciate it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add non-TopK benchmark variants for Utf8/Utf8View string aggregates

2 participants